The Lancet Digital Health
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match The Lancet Digital Health's content profile, based on 25 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Fisher, G. R.
Show abstract
In previous work, we achieved state-of-the-art performance on ChestX-ray14 (ROC-AUC 0.940, F1 0.821) using pretraining diversity and clinical metric optimization. Applying the same methodology to CheXpert, we received similar results when using NLP valuation and test data--but when evaluated against expert radiologist labels, performance was only 0.75-0.87 ROC-AUC. The models had learned to match the automated NLP labeling system, not to diagnose disease. This paper documents our investigation into this failure and our suggested resolution. We identify the NLP-to-Expert generalization gap: a systematic divergence between models optimized on labels extracted from radiology reports and their agreement with board-certified radiologists. More surprisingly, we discovered that directly optimizing for small expert-labeled validation sets can be counterproductive-- models with lower validation scores often generalized better to held-out expert test data. Four findings emerged: First, expert-labeled images for at least the validation and testing datasets, even if not for training, were vital in revealing the gap between NLP agreement and diagnostic accuracy. Without them, our models appeared excellent while failing to generalize to clinical judgment. Second, less training is better. Short training (1-5 epochs) outperformed extended training (60+ epochs) because longer training doesnt improve the model--it memorizes the labelers mistakes. Third, ImageNet features are sufficient. Freezing the pretrained backbone and training only the classifier achieved 0.891 ROC-AUC--matching models with full fine-tuning. The rapid convergence we observed wasnt the model learning chest X-ray features; it was the classifier calibrating to already-sufficient visual representations. Fourth, regularization beats optimization. Label smoothing and frozen backbones--methods that prevent overfitting--outperformed direct metric optimization on small validation sets. The 200 expert-labeled validation images in CheXpert are too few to optimize directly; they are better used as a compass than a target. With these insights, we improved from 0.823 to 0.917 ROC-AUC, exceeding Stanfords official baseline (0.907).
Farquhar, H.
Show abstract
BackgroundFoundation models have emerged as a promising paradigm for medical imaging AI [7], with claims of improved generalization and reduced bias. However, their robustness to technical acquisition parameters remains unexplored. We evaluated whether foundation models exhibit greater robustness to chest radiograph view type (anteroposterior [AP] versus posteroanterior [PA]) compared to traditional convolutional neural networks. MethodsWe compared four model architectures on the RSNA Pneumonia Detection Challenge dataset (n=26,684 images) and externally validated on the NIH ChestX-ray14 dataset (n=112,120 images): DenseNet-121 (supervised CNN), BiomedCLIP (vision-language model trained on 15 million biomedical image-text pairs), RAD-DINO (self-supervised model trained on 5+ million radiographs), and CheXzero (vision-language model trained on MIMIC-CXR chest radiographs). Primary outcome was the sensitivity gap between AP and PA views, with bootstrap confidence intervals and permutation testing. ResultsOn RSNA, CheXzero showed the smallest gap (14.3%, 95% CI: 11.2-17.5%), followed by RAD-DINO (25.2%, 22.6-27.9%), DenseNet-121 (35.7%, 32.9-38.7%), and BiomedCLIP (36.1%, 33.5-39.0%). However, on external validation (NIH), model rankings reversed completely: RAD-DINO demonstrated the smallest gap (22.3%, 95% CI: 21.0-23.6%), while CheXzeros gap increased dramatically to 48.9% (95% CI: 47.7-50.1%). Domain-specific training provided robustness within the training domain but failed to generalize. On PA view pneumonia cases in NIH, 31% were missed by all four models, representing a systematic blind spot. View type explained 61-100% of performance variance across models on both datasets, compared to 0-38% for age and less than 4% for sex. ConclusionsFoundation models do not eliminate technical acquisition parameter biases in chest X-ray AI. While domain-specific training (CheXzero) provided superior robustness on internal validation, this advantage collapsed on external data. Self-supervised learning (RAD-DINO) demonstrated the most generalizable robustness, with consistent view type gap stability across datasets with different labeling schemes (25.2% [->] 22.3%, despite substantial AUC differences). These findings challenge assumptions about foundation model generalization and highlight the need for acquisition parameter auditing in AI regulatory frameworks and multi-site external validation for robustness claims.
Quill, S.; Hingorani, A. D.; Chaturvedi, N.; Schmidt, A. F.
Show abstract
BackgroundPopulation cancer screening detects the presence of early-stage disease rather than assessing future disease risk. We evaluated whether widely implemented cardiovascular disease (CVD) risk models can predict 10-year cancer risk and compared them with a less widely used cancer risk model (QCancer). MethodsWe evaluated four CVD prediction models: QRISK3, the Pooled Cohort Equations (PCE), SCORE2 and SCORE2-OP. All models were recalibrated using 20% of the UK Biobank (UKB) cohort and tested in the remainder, as well as in the Clinical Practice Research Datalink (CPRD). We gauged model performance using c-statistics for discrimination and evaluated the fidelity of calibration. We also identified the most influential risk factors in the QRISK3 model. FindingsIn the UKB test set, the c-statistics for incident CVD ranged from 0{middle dot}71 to 0{middle dot}74 (11,022 events). All CVD models achieved a c-statistic of 0{middle dot}63 for any cancer (23,010 events) and showed CVD-equivalent discrimination for gastro-oesophageal, liver and biliary tree, laryngeal, renal tract, and lung cancers (c-statistic range: 0{middle dot}70;0{middle dot}81). Overall, the discrimination of the CVD models was comparable that of the QCancer models (median difference in c-statistic: -0{middle dot}01 (95%CI -0{middle dot}03;0{middle dot}00). The recalibrated CVD models showed near-perfect calibration (median intercept 0{middle dot}01, Q1;Q3 -0{middle dot}05;0{middle dot}03 and slope 1{middle dot}00, Q1;Q3 0{middle dot}93;1{middle dot}15). Performance in CPRD (393,658 cancer events) was similar: the median c-statistic, calibration intercept, and slope were 0{middle dot}01 (95%CI 0{middle dot}00;0{middle dot}02), 0{middle dot}05 (95%CI 0{middle dot}02;0{middle dot}17), and 0{middle dot}04 (95%CI 0{middle dot}01;0{middle dot}15) higher, respectively, in CPRD than in UKB. After age, smoking status and systolic blood pressure were the most influential predictors of cancer risk. InterpretationWidely implemented CVD prediction models perform similarly to the QCancer models in the prediction of incident cancers. They may be used to inform cancer prevention and guide risk-stratified monitoring. The recalibrated models are available through an API. FundingHealth Data Research UK, British Heart Foundation and UK Research and Innovation.
Mitsuyama, Y.; Walston, S. L.; Takita, H.; Saito, K.; Ueda, D.
Show abstract
Purpose: To evaluate whether chest radiograph-derived age acceleration is associated with incident lung cancer and whether it improves discrimination beyond established lung cancer risk factors. Materials and Methods: This retrospective analysis used prospectively collected data from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. Baseline digitized chest radiographs from the initial screening year were analyzed using a previously validated deep learning model that estimates chest radiograph-derived age (Xp-age). Age acceleration (AgeAccel) was defined as the residual of Xp-age after calibration to chronological age using a regression model from the development dataset. A 1-year landmark design excluded participants diagnosed with lung cancer or censored within 1 year of baseline. Associations with incident lung cancer were assessed using multivariable Cox proportional-hazards models adjusted for prespecified demographic and clinical predictors, including smoking variables used in the PLCOm2012 risk prediction model. Discrimination was evaluated using the concordance index and 6-year time-dependent area under the receiver-operating-characteristic curve. Results: The analytic cohort included 23,213 participants (mean age, 62.5 years); 790 developed incident lung cancer after the landmark (mean follow-up, 16.7 years). Higher AgeAccel was associated with increased lung cancer incidence (hazard ratio, 1.10 per 1-SD increase; 95% confidence interval: 1.03- 1.17); however, addition of AgeAccel to an established risk factor model resulted in minimal change in discrimination (C-index, 0.840 vs. 0.839; time-dependent AUC at 6 years, 0.852 vs. 0.852). Attribution maps emphasized the aortic arch/mediastinal region with similar spatial patterns across smoking and lung cancer strata. Conclusion: Chest radiograph-derived age acceleration was independently associated with future lung cancer incidence.
Tan, J.; Tang, P. H.
Show abstract
Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.
Lettner, J. D.; Evrenoglou, T.; Binder, H.; Fichtner-Feigl, S.; Neubauer, C.; Ruess, D. A.
Show abstract
BackgroundAI-based radiomics has demonstrated promising diagnostic performance for pancreatic cystic neoplasms, yet clinical translation remains limited. Whether this reflects insufficient model performance or structural limitations of the evidence base remains unclear. MethodsWe performed a systematic review and diagnostic test accuracy meta-analysis of AI-based radiomics in pancreatic cyst (2015-2025), addressing two clinically relevant tasks (Q1: cyst type differentiation/Q2: malignancy or high-grade dysplasia prediction). Training and validation datasets were synthesized independently using hierarchical models. Study evaluation extended beyond diagnostic performance to a four-dimensional framework integrating RQS 2.0, METRICS, TRIPOD+AI and PROBAST+AI explicitly contrasting pooled diagnostic performance with reporting quality, methodological rigor, and risk of bias. The review was pre-registered (PROSPERO) and conducted according to PRISMA 2020. ResultsTwenty-nine studies were included (Q1: n = 15; Q2: n = 14), predominantly retrospective and single center. Training-based analyses showed high apparent diagnostic performance for Q1 (pooled sensitivity/specificity: 0.89 [95% CI, 0.85-0.92]/ 0.90 [0.85-0.93]), but there was substantial heterogeneity ({tau}{superscript 2} = 0.56/0.78; {rho} = 0.38). Validation-based performance remained high (0.86 [0.82-0.89]/ 0.88 [0.81-0.93]), while heterogeneity persisted and prediction regions exceeded confidence regions. Training-based analyses demonstrated similarly high apparent performance (0.88 [0.79-0.95]/0.89 [0.81-0.94]) for Q2, with pronounced heterogeneity ({tau}{superscript 2} = 1.98/1.61; {rho} = 0.63). Validation-based performance was slightly lower, yet still clinically comparable (0.82 [0.75-0.89]/0.86 [0.80-0.91]), and heterogeneity persisted ({tau}{superscript 2} = 0.71/0.43; {rho} = 0.15). Across both tasks, high diagnostic accuracy occurred alongside incomplete reporting, limited validation and an elevated risk of bias. ConclusionAI-based radiomics for pancreatic cysts has reached a structural performance plateau. Further improvements in diagnostic accuracy alone are insufficient to achieve clinical translation and must be accompanied by a paradigm shift from performance-driven model development toward decision-anchored study designs, robust validation strategies, transparent reporting standard, and clinically integrated evaluation frameworks. SummaryAlthough pancreatic cystic lesions are increasingly being detected, imaging-based decision-making remains limited, particularly regarding differentiating between cyst types and stratifying malignancy risk. In this PRISMA-compliant and PROSPERO-registered systematic review and meta-analysis of diagnostic tests, we evaluated the use of AI-based radiomics for these two tasks, as well as its contextualized performance. In addition, a four-dimensional framework was employed to conduct the evaluation, incorporating diagnostic accuracy, reporting quality, risk of bias, and radiomics maturity. Across studies published between 2015 and 2025, the pooled diagnostic performance was consistently high, with only modest declines observed from the training to the validation stage. Nevertheless, considerable heterogeneity between studies and limited transportability remained evident. Multidimensional evaluation indicated a systematic dissociation between reported performance and methodological robustness, characterized by incomplete reporting, restricted validation, and an elevated risk of bias. These limitations were consistent across both clinical questions and were not resolved by increasing model complexity. The findings of this meta-analysis suggest that the structural performance of AI-based radiomics for pancreatic cysts has plateaued. To progress towards clinical translation, it is necessary to employ study designs anchored in decision-making processes, robust multi-center validation, and transparent, reproducible evaluation frameworks. This is preferred to further optimization of model architecture alone.
Vazquez, J.; Taylor, L.; Chen, Y.-Y. K.; Araya, K.; Farnsworth, M. G.; Xue, X.; Hasan, M.; N3C Consortium,
Show abstract
Predicting hospital outcomes for patients with severe acute respiratory infections is critical for risk stratification and resource planning, yet heterogeneous electronic health record (EHR) data, class imbalance, and evolving clinical practice present persistent methodological challenges for machine learning (ML) approaches. We conducted a retrospective cohort study using EHR data harmonized to the OMOP common data model from the National COVID Cohort Collaborative (N3C; May 2020-June 2025), including 263,619 adults hospitalized with COVID-19 across 51 contributing sites. We developed penalized linear regression (elastic net), random forest, XGBoost, and multilayer perceptron (MLP) models to predict hospital length of stay (LOS) and mortality (in-hospital and 60-day), using demographics, comorbidities, prior healthcare utilization, COVID-19 vaccination status, and hospital site as predictors. Missing data were handled via multiple imputation by chained equations (MICE) and class imbalance was addressed using SMOTE. Model performance was evaluated using area under the ROC curve (AUROC), Brier score, calibration plots, and decision curve analysis, following the TRIPOD reporting framework. Mortality prediction achieved moderate discrimination across all models (test AUROC = 0.71-0.73 for in-hospital mortality; 0.72-0.73 for 60-day all-cause mortality). Models trained without SMOTE achieved the highest AUROCs but assigned virtually no patients to the mortality class at the default 0.5 threshold. SMOTE improved recall and F-1 score at the cost of reduced AUROC and precision. LOS was poorly explained by available structured predictors (best R2 = 0.059). Remdesivir-treated patients (n = 103,536; 39.3%) were older, had higher comorbidity burden, and had higher unadjusted mortality than untreated patients. Common structured EHR features offer moderate utility for mortality risk stratification in hospitalized COVID-19 patients but are insufficient for LOS prediction. The consistent SMOTE-related tradeoff between discrimination and calibration underscores the need to report threshold-dependent metrics alongside AUROC in clinical ML studies, with implications for operational planning during future respiratory disease emergencies.
Barlow, M.; Down, L.; Mounce, L.; Merriel, S. W. D.; Watson, J.; Martins, T. O.; Bailey, S. E.
Show abstract
BackgroundPlatelet count and C-reactive protein (CRP) are blood tests commonly used in primary care as part of diagnostic work up for symptomatic patients. Abnormal results of these tests can indicate an undetected cancer; however, it is not known whether the association between an abnormal test result and cancer risk varies by patient ethnicity. MethodsThis cohort study used routinely collected primary and secondary health care records in England with linkage to national cancer registry data. Included patients had a record of ethnicity, no prior malignancy, a platelet count or CRP record between 1st January 2010 and 31st December 2017, and were aged 40 years or over at the time of that test. Ethnicity was categorised as White, Asian, Black, Other, and Mixed. Multi-level logistic regression models estimated cancer incidence within one-year of testing, adjusted for age, sex, comorbidities, BMI, deprivation, and year of test. ResultsAmong 4,948,342 patients with a platelet record and 811,559 with a CRP record, one-year cancer incidence was highest among White patients and lowest among Asian patients. Following a normal platelet count, cancer incidence was 1.3% (95% CI 1.3-1.3%) for White patients and 0.63% (0.60-0.66%) for Asian patients; following thrombocytosis, incidence increased to 4.1% (4.0-4.2%) and 1.8% (1.5-2.0%), respectively. After a normal CRP result, cancer incidence was 1.5% (1.4-1.5%) for White patients and 0.79% (0.71-0.88%) for Asian patients, rising to 3.6% (3.5-3.7%) and 1.9% (1.7-2.2%) for a high CRP result, respectively. No significant interactions were found between ethnicity, blood test result, and overall cancer diagnosis, and similar diagnostic odds ratios (dOR) were observed across all ethnic groups. However, for colorectal cancer, Black patients with abnormal results showed higher diagnostic odds ratios (dOR) compared with White patients, relative to a normal result. The dOR for thrombocytosis was 11.1 (7.8-15.6) for Black patients versus 5.7 (5.4-6.0) for White patients (interaction p-value <0.001), and for raised CRP was 4.1 (2.6-6.6) for Black patients versus 2.5 (2.3-2.7) for White patients (interaction p-value=0.043). ConclusionThis large primary care study underscores the need for ethnically diverse cohorts when evaluating diagnostic tests to avoid widening healthcare inequalities.
Makacha, L.; Makanga, P. T.; Tonne, C.; Volvert, M.-L.; Nunes, J.; Jah, H.; Sevene, E.; Mukhanya, M.; Koech, A.; Wanje, O.; Vala, A.; Mistry, H. D.; Sandhu, A.; Blencowe, H.; D'Alessandro, U.; Waiswa, A. J. N.; Temmerman, M.; Roca, A.; Bone, J. N.; Idris, Y.; Magee, L. A.; Barratt, B.; von Dadelszen, P.
Show abstract
IntroductionAmbient and indoor fine particle air pollution (PM2{middle dot}5) estimates have been associated with pregnancy complications. We aimed to link direct personal exposure measurements with placenta-mediated pregnancy complications in three sub-Saharan African countries. MethodsWe recruited a geographically and energy use stratified sub-sample of 343 rural and urban women who had recently given birth in the PREgnancy Care Integrating translational Science, Everywhere (PRECISE) prospective pregnancy cohort in The Gambia (n = 160), Kenya (n = 105), and Mozambique (n = 78). Individual-level exposure to PM2{middle dot}5 was assessed using high-resolution personal monitoring, during both wet and dry seasons. Minute-level data were summarised as mean and peak daily PM2{middle dot}5 concentrations, and correlated with maternal blood pressure (BP), gestational age at delivery, fetal growth, and stillbirth, in the index pregnancy. Results107/343 (32{middle dot}2%) women experienced pregnancy hypertension, 57/343 (16{middle dot}0%) women delivered preterm, 203/304 (66{middle dot}8%) infants with known birthweights were appropriately-grown, and 9/343 (2{middle dot}7%) infants were stillborn. Higher mean (p=0{middle dot}012) and peak (p=0{middle dot}007) exposures were associated with reduced fetal growth velocity, with greater mean exposure associated with small-for-gestational age infants (p=0{middle dot}016). Greater mean (p=0{middle dot}017) and peak (p=0{middle dot}045) PM{square}.{square} exposures were associated with lower birthweight centile. No associations were observed with pregnancy hypertension, pregnancy duration, or stillbirth. DiscussionThis study provides exploratory evidence that personal PM2{middle dot}5 exposure is associated with impaired fetal growth in sub-Saharan Africa. Prioritising access to clean fuels, reducing emissions from informal transport and waste systems, and incorporating personal exposure monitoring into maternal health frameworks could yield measurable improvements in birth outcomes and health equity.
Boiardi, F. E.; Lain, A. D.; Posma, J. M.
Show abstract
Pneumonia detection in chest X-rays (CXRs) is complicated by high inter-observer variability and overlapping radiographic patterns. While deep learning (DL) solutions show promise, limitations in generalisability and explainability hinder clinical adoption. We address these challenges by introducing a holistic DL-based computer-aided diagnosis (CAD) pipeline for pneumonia detection, localisation, and structured report generation from CXRs. We curated the largest composite of publicly available CXRs to date (N=922,634), of which [Formula] were used for training. MIMIC-CXR radiology reports were relabelled using a local large language model (LLM), positing that LLM-derived pneumonia labels would yield higher diagnostic sensitivity than the provided rule-based natural language processing (rNLP) labels. DenseNet-121 classifiers were trained on four configurations: MIMIC-CXR (rNLP), MIMIC-CXR (LLM), and each supplemented with VinDr-CXR data. Gradient-weighted Class Activation Mapping (Grad-CAM) provided visual explainability and lung zone-based localisation. LLM-driven relabelling significantly improved human-label agreement (96.5% vs 72.5%, P=1.66x10-11). The best-performing model (MIMIC-CXR (LLM) + VinDr-CXR) achieved 82.08% sensitivity and 81.97% precision, surpassing both radiologist sensitivity ranges (64-77.7%) and CheXNets pneumonia F1-score (43.5%). Grad-CAM localisation attained a moderate F1-score of 52.9% (sensitivity=65.7%, precision=44.3%), confirming focus alignment with pathological lung regions while highlighting areas for refinement. These findings demonstrate that LLM-driven label curation, combined with DL, can exceed conventional rNLP and radiologist performance, advancing high-quality data integration in predictive medical imaging. Clinically, our pipeline offers rapid triage, automated report drafting, and real-time pneumonia surveillance; tools that can streamline radiology workflows and mitigate diagnostic errors.
Choi, H.; Bae, S.; Na, K. J.
Show abstract
BackgroundAlthough deep learning models have improved individual PET analysis, image processing and quantification tasks, end-to-end automation from raw DICOM to quantitative clinical reporting remains limited, particularly in heterogeneous real-world settings. MethodsAs a proof-of-concept, an autonomous large language model (LLM)-orchestrated multi-tool agent for end-to-end PET/CT interpretation was developed. A reasoning-based text LLM selected appropriate series from raw DICOM, coordinated registration and SUV conversion, invoked segmentation and detection tools, generated maximum-intensity projections, called a vision-enabled LLM for interpretation, and synthesized structured draft reports. The system was retrospectively evaluated in 170 patients undergoing baseline FDG PET/CT for lung cancer staging, using expert reports as reference. ResultsThe agent successfully completed the full end-to-end workflow from raw DICOM selection to structured draft report generation without human intervention in all 170 examinations. Primary tumor detection achieved 100% sensitivity. For nodal involvement, sensitivity was 84.8% and specificity was 39.4%, whereas distant metastasis detection showed 70.2% sensitivity and 65.0% specificity. Discrepancy analysis of 58 nodal and 57 metastatic mismatch cases revealed systematic false-positive findings related to reactive or physiologic uptake and false-negative findings involving small-volume or anatomically atypical metastases. ConclusionLLM-orchestrated PET/CT agents can enable workflow-level automation from raw DICOM to quantification and structured draft reporting under real-world conditions. Although primary tumor detection was highly reliable, nodal and metastatic assessment revealed systematic limitations, supporting a collaborative role with continued expert oversight in complex clinical scenarios.
Prestige, E.; Warren-Gash, C.; Quint, J. K.; Evans, D.; Costello, R. E.; Mehrkar, A.; Bacon, S.; Goldacre, B.; Barley-McMullen, S.; Yameen, F.; Shah, P.; Natt, M.; Alder, Y.; Hulme, W.; Parker, E. P. K.; Eggo, R. M.
Show abstract
Electronic health records (EHRs) are a rich source of data which can be used to analyse health outcomes using computable phenotypes. With the approval of NHS England we used the OpenSAFELY secure analytics platform to design and assess phenotypes to classify three key respiratory viruses - respiratory syncytial virus (RSV), influenza, and COVID-19 - in English coded health data between September 2016 and August 2024. We compared specific and sensitive phenotypes to one another and to publicly available surveillance data. Cases from both phenotypes showed similar seasonal patterns to surveillance data. Sensitive phenotypes led to increased risk of misclassification than specific phenotypes for mild cases. For severe cases the risk of misclassification was higher in infants than for older adults, irrespective of the phenotype used. The phenotypes presented here offer a solution to classifying respiratory viruses from coded health records in the absence of testing information.
EL Moudden, I.; Bittner, M.; Dodani, S.
Show abstract
BackgroundCardiovascular disease (CVD) readmissions impose substantial clinical and economic burden. Machine learning (ML) may improve risk stratification, yet most predictive models aggregate CVD subtypes into a single outcome and underrepresent Black populations. MethodsUsing Virginia Health Information database records (2010 to 2020), we analyzed 157,791 discharge records from 123,272 unique patients (96.6% Black) to develop condition-specific 30-day readmission models for heart failure (HF; n=91,752), acute myocardial infarction (AMI; n=34,497), atrial fibrillation/flutter (AF/AFL; n=18,424), and hypertensive heart disease (HHD; n=13,118). Four algorithms (XGBoost, LightGBM, Random Forest, Elastic Net) plus a Super Learner ensemble were trained on patient-grouped 70/30 splits with and without Synthetic Minority Oversampling Technique balancing. Models incorporated validated clinical indices (LACE, Charlson, Elixhauser) and administrative social determinants of health proxies. ResultsThe overall 30-day readmission rate was 18.9%. Best area under the receiver operating characteristic curve (AUC) values by condition were HF 0.708 (95% CI, 0.701 to 0.716), AMI 0.706 (95% CI, 0.691 to 0.721), AF/AFL 0.732 (95% CI, 0.715 to 0.750), and HHD 0.758 (95% CI, 0.735 to 0.777). XGBoost was the top-performing algorithm for three of four subtypes. The LACE Index, Charlson Comorbidity Index, and insurance type were consistently the strongest predictors. Algorithm-native, aggregated, and SHAP-based importance measures converged on these key features. ConclusionsIn this largest-to-date, predominantly Black statewide cohort, condition-specific ML models achieved moderate-to-high discrimination for HF, AMI, AF/AFL, and HHD. Key clinical indices and administrative social determinants proxies emerged as dominant predictors, highlighting modifiable targets and high-risk subgroups. These findings support the development of precision, equity-informed readmission interventions and provide a scalable framework for deploying ML-driven decision support in safety-net and minority-serving healthcare systems. WHAT IS KNOWN* Machine learning models for cardiovascular readmission prediction have largely aggregated disease subtypes and underrepresented Black populations. * Most existing studies lack head-to-head algorithm comparisons within racially concentrated cohorts and omit social determinants of health proxies. WHAT THE STUDY ADDS* Condition-specific models for four cardiovascular subtypes achieved moderate-to-high discrimination (AUC 0.690 to 0.706) in the largest machine learning-based analysis of a predominantly Black statewide cohort. * Validated clinical indices (LACE, Charlson) and insurance type consistently emerged as dominant predictors, identifying modifiable targets for equity-informed intervention. * The scalable, administrative-data-only framework supports deployment of subtype-specific readmission decision support in safety-net and minority-serving health systems.
Weber, M.; Fischer, C.
Show abstract
Withdrawal StatementThis article has been withdrawn by medRxiv because it was submitted with false information.
Whitfield, J. A.; Graves, E. M.
Show abstract
Withdrawal StatementThis article has been withdrawn by medRxiv because it was submitted with false information.
Niggemeier, L.; Hoelscher, D. L.; Herkens, T. C.; Gilles, P.; Boor, P.; Buelow, R.
Show abstract
IntroductionKidney biopsy reports contain rich information that is clinically actionable and useful for research. However, the narrative format hinders scalable reuse. We here investigated whether open-source large language models (LLMs) can extract relevant, standardized readouts from native kidney biopsy pathology reports. MethodsGerman free-text native kidney biopsy reports were parsed with three open-source LLMs (Llama3 70B, Llama3 8B, MedGemma) to generate structured JSON outputs covering relevant report elements (e.g., diagnosis, glomerular counts, histopathological patterns). Two independent observers manually curated the same report elements; disagreements between the two were resolved by an experienced nephropathologist to create the final ground truth. Performance was assessed using strict and soft matching and summarized accuracy. Inter-rated agreement was quantified using Cohens and Lights Kappa with 95% confidence intervals via 1000-times bootstrapping. ResultsLlama3 70B achieved the highest overall accuracy (93.3% strict, 97.1% soft), followed by MedGemma. These larger models showed near perfect performance for explicit and discrete variables and positivity of immunohistochemistry markers, while accuracy decreased for report elements requiring interpretation (e.g., primary diagnosis, interstitial inflammation in fibrosis vs. non-fibrotic cortex). Human raters showed strong agreement for the primary diagnosis ({kappa} = 0.74, 95% CI 0.64-0.84). Adding Llama3 70B or MedGemma as a third rater increased overall agreement (0.82, 95% CI 0.74-0.89 and 0.78, 95% CI 0.69-0.85, respectively), whereas Llama3 8B reduced it. ConclusionsOpen-source LLMs can accurately transform narrative nephropathology reports into a structured and machine-readable format, potentially supporting scalable retrospective cohort building. While some report elements can be extracted without supervision, interpretation-dependent elements should be supervised by a human observer. Lay SummaryRetrospective data collection from nephropathology reports is essential for building informative cohorts in computational nephropathology research, yet manual processing of narrative reports is time-consuming and limits scalability. In this study, we demonstrate that open-source large language models can reliably extract key diagnostic, quantitative, and descriptive data elements from kidney biopsy reports with high accuracy. While factual and clearly stated report elements can be extracted automatically, findings that require contextual or interpretative judgment still benefit from expert supervision. Overall, this approach substantially reduces manual effort and enables efficient generation of structured datasets from diagnostic routine, facilitating the development of kidney registries and future computational nephropathology research. In addition, such systems could be implemented into the routine diagnostic workflow, to directly transform narrative reports into structured data.
Ertl, R.; Syngelaki, A.; Frank, O.; Lueftinger, L.; Lukacova, E.; Lumby, C.; Stuetz, A.; Beisken, S.; Posch, A. E.; Nicolaides, K. H.
Show abstract
BackgroundPreeclampsia, which is a leading cause of maternal and perinatal mortality and morbidity, represents a biologically heterogeneous syndrome. First-trimester screening with the Fetal Medicine Foundation competing-risks model enables prevention of preterm preeclampsia through aspirin prophylaxis but depends on Doppler velocimetry and biochemical measurements that limit scalability and offer limited discrimination for term disease. A unified, molecular first trimester test capable of stratifying risk across the full clinical spectrum of preeclampsia has not been established. ObjectiveTo determine whether multi-modal, tissue-resolved analysis of first trimester circulating cell-free DNA (cfDNA), obtained during routine non-invasive prenatal testing (NIPT), enables early prediction of both preterm and term preeclampsia. Study DesignThis nested case-control study included 125 singleton pregnancies sampled at 11-14 weeks gestation after quality control (48 controls, 30 preterm preeclampsia, 47 term preeclampsia). For 80 pregnancies, matched placental villi and maternal buffy coat samples were available to derive tissue reference profiles. Plasma cfDNA underwent multi-modal sequencing using Oxford Nanopore Technologies, enabling tissue-resolved analysis of fragmentomic and epigenetic signatures. Separate ensemble machine-learning classifiers were developed for preterm (<37 weeks) and term ([≥]37 weeks) preeclampsia using stratified 10-fold cross-validation. Model discrimination was evaluated using area under the receiver operating characteristic curve (AUROC), sensitivity at predefined specificity thresholds, and comparison with the FMF first-trimester risk score. A population-level simulation of 100,000 pregnancies, applying incidence point estimates of 2.5% for preterm and 7.5% for term PE, was used to derive predictive values and likelihood ratios. ResultsThe multi-modal cfDNA classifier achieved an AUROC (95% CI) of 0.85 (0.77-0.91) for preterm preeclampsia and 0.84 (0.76-0.91) for term preeclampsia. The FMF score yielded an AUROC of 0.80 (0.70-0.89) for preterm and 0.53 (0.43-0.63) for term PE. At 80% specificity, cfDNA sensitivity was 70.5% for preterm and 72.1% for term preeclampsia, demonstrating improved discrimination for term disease compared with FMF screening. In simulated population-level analysis, positive likelihood ratios were 4.25 (preterm) and 3.83 (term), with negative likelihood ratios of 0.21 and 0.34, respectively, supporting meaningful post-test risk stratification and strong rule-out performance. ConclusionFirst-trimester multi-modal, tissue-resolved cfDNA analysis enables early risk stratification across the full clinical spectrum of preeclampsia from a single routine blood sample. Compared with FMF screening, this approach can potentially improve discrimination for term preeclampsia while providing incremental improvement for preterm disease. The potential for integration into existing NIPT workflows offers a scalable pathway toward unified precision prevention, supporting timely aspirin prophylaxis for preterm preeclampsia and risk-adapted surveillance strategies for term disease.
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, d. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeLarge language models (LLMs) can classify biomedical documents accurately, but strong performance does not prove they are using the supplied text rather than identifier-triggered parametric knowledge. We tested whether oncology trial-success classification reflects "reading" of abstract evidence or "remembering" of known trials. MethodsWe used a corpus of 250 two-arm oncology randomized controlled trials from seven major journals (2005 - 2023) and asked the flagship models of three commercial vendors (OpenAI, Google, and Anthropic) to output a single label indicating whether the primary endpoint was met. For each trial we created five deterministic inputs: title+abstract (baseline), title-only, DOI-only, counterfactual title+abstract with the primary endpoint outcome minimally flipped, and the same counterfactual title+abstract paired with the original DOI to induce an identifier-text conflict. ResultsWith full title+abstract, models achieved near-ceiling performance (accuracy and F1 Score 0.96 - 0.97) and high format adherence (97.2 - 100%). Performance degraded stepwise with content removal (title-only accuracy and F1 Score 0.79 - 0.88, DOI-only 0.63 - 0.67), consistent with above-chance identifier-driven signal. Under counterfactual results, models followed the edited evidence (accuracy and F1 Score 0.96 - 0.99 against inverted labels). Adding the real DOI minimally affected GPT (accuracy and F1 Score {approx} 0.99) but modestly reduced Gemini (accuracy and F1 Score {approx} 0.97) and Claude (accuracy and F1 Score {approx} 0.95), mainly via lower sensitivity. ConclusionLLMs robustly track explicit endpoint statements in abstracts, yet identifiers can support above-chance predictions and occasionally compete with textual evidence. Progressive ablations plus counterfactual conflicts provide a practical, reproducible audit for grounding in biomedical LLM evaluations.
Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.
Show abstract
Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.
Gombar, S.; Shah, N.; Sanghavi, N.; Coyle, J.; Mukerji, A.; Chappelka, M.
Show abstract
Background: The observational literature on comparative effectiveness is expanding rapidly but remains difficult to synthesize. Discordant findings often stem from structural differences in cohort definitions, inclusion criteria, and follow up windows, leaving stakeholders without a cohesive evidence base. Furthermore, studies typically focus on a narrow subset of outcomes, neglecting the broader needs of diverse healthcare stakeholders 1,2,3,4. Methods We developed a high throughput evidence generation workflow using linked EHR and administrative claims data. The cornerstone is a prespecified measurement architecture applied uniformly across clinical scenarios: six post index windows (acute to two year follow.up); 28 Elixhauser comorbidities; 14 healthcare resource utilization (HCRU) categories; 29 laboratory measures with 52 binary thresholds; and 42 adverse event categories. We generated unadjusted treatment comparisons across ~1,038 outcomes per scenario, including effect-measure modification (EMM) assessments across 130 baseline features. Results Across 40 clinical domains, the workflow produced approximately 32,982,552 outcome evaluations. An evaluation included a treatment comparison outcome population effect estimate with uncertainty bounds and supporting diagnostics. Approximately 5,000 narrative summaries underwent structured clinical and statistical quality control before dissemination. Conclusions Standardized, high throughput workflows can shift evidence generation away from fragmented studies toward comprehensive evidence packages. This shared evidence base supports precision medicine by making treatment effect heterogeneity visible across clinically meaningful subpopulations, reducing the need for redundant, stakeholder-specific studies.